Big Data - Homework1 - Ngoc 131112(Python)

Loading...
 
customerID
gender
SeniorCitizen
Partner
Dependents
tenure
PhoneService
MultipleLines
InternetService
OnlineSecurity
OnlineBackup
DeviceProtection
TechSupport
StreamingTV
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
7590-VHVEG
Female
0
Yes
No
1
No
No phone service
DSL
No
Yes
No
No
No
5575-GNVDE
Male
0
No
No
34
Yes
No
DSL
Yes
No
Yes
No
No
3668-QPYBK
Male
0
No
No
2
Yes
No
DSL
Yes
Yes
No
No
No
7795-CFOCW
Male
0
No
No
45
No
No phone service
DSL
Yes
No
Yes
Yes
No
9237-HQITU
Female
0
No
No
2
Yes
No
Fiber optic
No
No
No
No
No
9305-CDSKC
Female
0
No
No
8
Yes
Yes
Fiber optic
No
No
Yes
No
Yes
1452-KIOVK
Male
0
No
Yes
22
Yes
Yes
Fiber optic
No
Yes
No
No
Yes
6713-OKOMC
Female
0
No
No
10
No
No phone service
DSL
Yes
No
No
No
No
7892-POOKP
Female
0
Yes
No
28
Yes
Yes
Fiber optic
No
No
Yes
Yes
Yes
6388-TABGU
Male
0
No
Yes
62
Yes
No
DSL
Yes
Yes
No
No
No
9763-GRSKD
Male
0
Yes
Yes
13
Yes
No
DSL
Yes
No
No
No
No
7469-LKBCI
Male
0
No
No
16
Yes
No
No
No internet service
No internet service
No internet service
No internet service
No internet service
8091-TTVAX
Male
0
Yes
No
58
Yes
Yes
Fiber optic
No
No
Yes
No
Yes
0280-XJGEX
Male
0
No
No
49
Yes
Yes
Fiber optic
No
Yes
Yes
No
Yes
5129-JLPIS
Male
0
No
No
25
Yes
No
Fiber optic
Yes
No
Yes
Yes
Yes
3655-SNQYZ
Female
0
Yes
Yes
69
Yes
Yes
Fiber optic
Yes
Yes
Yes
Yes
Yes
8191-XWSZG
Female
0
No
No
52
Yes
No
No
No internet service
No internet service
No internet service
No internet service
No internet service
7,043 rows
 
customerID
gender
SeniorCitizen
Partner
Dependents
tenure
PhoneService
MultipleLines
InternetService
OnlineSecurity
OnlineBackup
DeviceProtection
TechSupport
StreamingTV
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
7590-VHVEG
Female
0
Yes
No
1
No
No phone service
DSL
No
Yes
No
No
No
5575-GNVDE
Male
0
No
No
34
Yes
No
DSL
Yes
No
Yes
No
No
3668-QPYBK
Male
0
No
No
2
Yes
No
DSL
Yes
Yes
No
No
No
7795-CFOCW
Male
0
No
No
45
No
No phone service
DSL
Yes
No
Yes
Yes
No
9237-HQITU
Female
0
No
No
2
Yes
No
Fiber optic
No
No
No
No
No
9305-CDSKC
Female
0
No
No
8
Yes
Yes
Fiber optic
No
No
Yes
No
Yes
1452-KIOVK
Male
0
No
Yes
22
Yes
Yes
Fiber optic
No
Yes
No
No
Yes
6713-OKOMC
Female
0
No
No
10
No
No phone service
DSL
Yes
No
No
No
No
7892-POOKP
Female
0
Yes
No
28
Yes
Yes
Fiber optic
No
No
Yes
Yes
Yes
6388-TABGU
Male
0
No
Yes
62
Yes
No
DSL
Yes
Yes
No
No
No
9763-GRSKD
Male
0
Yes
Yes
13
Yes
No
DSL
Yes
No
No
No
No
7469-LKBCI
Male
0
No
No
16
Yes
No
No
No internet service
No internet service
No internet service
No internet service
No internet service
8091-TTVAX
Male
0
Yes
No
58
Yes
Yes
Fiber optic
No
No
Yes
No
Yes
0280-XJGEX
Male
0
No
No
49
Yes
Yes
Fiber optic
No
Yes
Yes
No
Yes
5129-JLPIS
Male
0
No
No
25
Yes
No
Fiber optic
Yes
No
Yes
Yes
Yes
3655-SNQYZ
Female
0
Yes
Yes
69
Yes
Yes
Fiber optic
Yes
Yes
Yes
Yes
Yes
8191-XWSZG
Female
0
No
No
52
Yes
No
No
No internet service
No internet service
No internet service
No internet service
No internet service
7,043 rows
#Create a dataframe containing 'customerID' and 'MonthlyCharges'
df_f = df.select(['customerID', 'MonthlyCharges'])
display(df_f)
 
customerID
MonthlyCharges
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
7590-VHVEG
29.85
5575-GNVDE
56.95
3668-QPYBK
53.85
7795-CFOCW
42.3
9237-HQITU
70.7
9305-CDSKC
99.65
1452-KIOVK
89.1
6713-OKOMC
29.75
7892-POOKP
104.8
6388-TABGU
56.15
9763-GRSKD
49.95
7469-LKBCI
18.95
8091-TTVAX
100.35
0280-XJGEX
103.7
5129-JLPIS
105.5
3655-SNQYZ
113.25
8191-XWSZG
20.65
7,043 rows
# Get the number of customers that have Multiple Lines. Show the result as value of a single variable (2 points)
multiple_lines_count = df[df['MultipleLines']=='Yes'].count()
display(multiple_lines_count)

2971
# What are the minimum and maximum values of tenure? Show these as single variables. (2 points)
import pyspark.sql.functions as f
df_min_max = df.agg(f.min(col('Tenure')).alias('Min Tenure'),f.max(col('Tenure')).alias('Max Tenure'))
display(df_min_max)
 
Min Tenure
Max Tenure
1
0
72
1 row
# What are the monthly charges depending on the gender of the customer?

monthly_charges_by_gender =df.groupBy(df['gender']).agg({'MonthlyCharges': 'avg'})
display(monthly_charges_by_gender)
 
gender
avg(MonthlyCharges)
1
2
Female
65.20424311926602
Male
64.32748241912773
2 rows
# Do customers who have a yearly contract pay on average more than the others? Return the result as a single bool variable. 
# Please note that others in this context means that we treat together both the Two year and month-to-month customers. (2 points)

from pyspark.sql.functions import col

yearly_avg = df.filter(col('contract') == 'One year').agg({'MonthlyCharges': 'avg'}).collect()[0][0]
others_avg = df.filter((col('contract') == 'Two year') | (col('contract') == 'Month-to-month')).agg({'MonthlyCharges': 'avg'}).collect()[0][0]
yearly_pays_more = yearly_avg > others_avg
print(yearly_pays_more)
True
#Create a column that will show the average charges computed as a division of Total Charges by tenure
df_e=df.withColumn('Average charges',f.round(col('TotalCharges')/col('Tenure'),2))
display(df_e)
 
customerID
gender
SeniorCitizen
Partner
Dependents
tenure
PhoneService
MultipleLines
InternetService
OnlineSecurity
OnlineBackup
DeviceProtection
TechSupport
StreamingTV
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
7590-VHVEG
Female
0
Yes
No
1
No
No phone service
DSL
No
Yes
No
No
No
5575-GNVDE
Male
0
No
No
34
Yes
No
DSL
Yes
No
Yes
No
No
3668-QPYBK
Male
0
No
No
2
Yes
No
DSL
Yes
Yes
No
No
No
7795-CFOCW
Male
0
No
No
45
No
No phone service
DSL
Yes
No
Yes
Yes
No
9237-HQITU
Female
0
No
No
2
Yes
No
Fiber optic
No
No
No
No
No
9305-CDSKC
Female
0
No
No
8
Yes
Yes
Fiber optic
No
No
Yes
No
Yes
1452-KIOVK
Male
0
No
Yes
22
Yes
Yes
Fiber optic
No
Yes
No
No
Yes
6713-OKOMC
Female
0
No
No
10
No
No phone service
DSL
Yes
No
No
No
No
7892-POOKP
Female
0
Yes
No
28
Yes
Yes
Fiber optic
No
No
Yes
Yes
Yes
6388-TABGU
Male
0
No
Yes
62
Yes
No
DSL
Yes
Yes
No
No
No
9763-GRSKD
Male
0
Yes
Yes
13
Yes
No
DSL
Yes
No
No
No
No
7469-LKBCI
Male
0
No
No
16
Yes
No
No
No internet service
No internet service
No internet service
No internet service
No internet service
8091-TTVAX
Male
0
Yes
No
58
Yes
Yes
Fiber optic
No
No
Yes
No
Yes
0280-XJGEX
Male
0
No
No
49
Yes
Yes
Fiber optic
No
Yes
Yes
No
Yes
5129-JLPIS
Male
0
No
No
25
Yes
No
Fiber optic
Yes
No
Yes
Yes
Yes
3655-SNQYZ
Female
0
Yes
Yes
69
Yes
Yes
Fiber optic
Yes
Yes
Yes
Yes
Yes
8191-XWSZG
Female
0
No
No
52
Yes
No
No
No internet service
No internet service
No internet service
No internet service
No internet service
7,043 rows
Beta
0 / 10used queries